skip to main content


Search for: All records

Creators/Authors contains: "Ni, Peng"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    The high sequencing error rate has impeded the application of long noisy reads for diploid genome assembly. Most existing assemblers failed to generate high-quality phased assemblies using long noisy reads. Here, we present PECAT, aPhasedErrorCorrection andAssemblyTool, for reconstructing diploid genomes from long noisy reads. We design a haplotype-aware error correction method that can retain heterozygote alleles while correcting sequencing errors. We combine a corrected read SNP caller and a raw read SNP caller to further improve the identification of inconsistent overlaps in the string graph. We use a grouping method to assign reads to different haplotype groups. PECAT efficiently assembles diploid genomes using Nanopore R9, PacBio CLR or Nanopore R10 reads only. PECAT generates more contiguous haplotype-specific contigs compared to other assemblers. Especially, PECAT achieves nearly haplotype-resolved assembly onB. taurus(Bison×Simmental) using Nanopore R9 reads and phase block NG50 with 59.4/58.0 Mb for HG002 using Nanopore R10 reads.

     
    more » « less
  2. Abstract

    Long single-molecular sequencing technologies, such as PacBio circular consensus sequencing (CCS) and nanopore sequencing, are advantageous in detecting DNA 5-methylcytosine in CpGs (5mCpGs), especially in repetitive genomic regions. However, existing methods for detecting 5mCpGs using PacBio CCS are less accurate and robust. Here, we present ccsmeth, a deep-learning method to detect DNA 5mCpGs using CCS reads. We sequence polymerase-chain-reaction treated and M.SssI-methyltransferase treated DNA of one human sample using PacBio CCS for training ccsmeth. Using long (≥10 Kb) CCS reads, ccsmeth achieves 0.90 accuracy and 0.97 Area Under the Curve on 5mCpG detection at single-molecule resolution. At the genome-wide site level, ccsmeth achieves >0.90 correlations with bisulfite sequencing and nanopore sequencing using only 10× reads. Furthermore, we develop a Nextflow pipeline, ccsmethphase, to detect haplotype-aware methylation using CCS reads, and then sequence a Chinese family trio to validate it. ccsmeth and ccsmethphase can be robust and accurate tools for detecting DNA 5-methylcytosines.

     
    more » « less
  3. Birol, Inanc (Ed.)
    Abstract Motivation Oxford Nanopore sequencing has great potential and advantages in population-scale studies. Due to the cost of sequencing, the depth of whole-genome sequencing for per individual sample must be small. However, the existing single nucleotide polymorphism (SNP) callers are aimed at high-coverage Nanopore sequencing reads. Detecting the SNP variants on low-coverage Nanopore sequencing data is still a challenging problem. Results We developed a novel deep learning-based SNP calling method, NanoSNP, to identify the SNP sites (excluding short indels) based on low-coverage Nanopore sequencing reads. In this method, we design a multi-step, multi-scale and haplotype-aware SNP detection pipeline. First, the pileup model in NanoSNP utilizes the naive pileup feature to predict a subset of SNP sites with a Bi-long short-term memory (LSTM) network. These SNP sites are phased and used to divide the low-coverage Nanopore reads into different haplotypes. Finally, the long-range haplotype feature and short-range pileup feature are extracted from each haplotype. The haplotype model combines two features and predicts the genotype for the candidate site using a Bi-LSTM network. To evaluate the performance of NanoSNP, we compared NanoSNP with Clair, Clair3, Pepper-DeepVariant and NanoCaller on the low-coverage (∼16×) Nanopore sequencing reads. We also performed cross-genome testing on six human genomes HG002–HG007, respectively. Comprehensive experiments demonstrate that NanoSNP outperforms Clair, Pepper-DeepVariant and NanoCaller in identifying SNPs on low-coverage Nanopore sequencing data, including the difficult-to-map regions and major histocompatibility complex regions in the human genome. NanoSNP is comparable to Clair3 when the coverage exceeds 16×. Availability and implementation https://github.com/huangnengCSU/NanoSNP.git. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  4. Abstract Mineral/melt partition coefficients have been widely used to provide insights into magmatic processes. Olivine is one of the most abundant and important minerals in the lunar mantle and mare basalts. Yet, no systematic olivine/melt partitioning data are available for lunar conditions. We report trace element partition data between host mineral olivine and its melt inclusions in lunar basalts. Equilibrium is evaluated using the Fe-Mg exchange coefficient, leading to the choice of melt inclusion-host olivine pairs in lunar basalts 12040, 12009, 15016, 15647, and 74235. Partition coefficients of 21 elements (Li, Mg, Al, Ca, Ti, V, Cr, Mn, Fe, Co, Y, Zr, Nb, Gd, Tb, Dy, Ho, Er, Tm, Yb, and Lu) were measured. Except for Li, V, and Cr, these elements show no significant difference in olivine-melt partitioning compared to the data for terrestrial samples. The partition coefficient of Li between olivine and melt in some lunar basalts with low Mg# (Mg# < 0.75 in olivine, or < ~0.5 in melt) is higher than published data for terrestrial samples, which is attributed to the dependence of DLi on Mg# and the lack of literature DLi data with low Mg#. The partition coefficient of V in lunar basalts is measured to be 0.17 to 0.74, significantly higher than that in terrestrial basalts (0.003 to 0.21), which can be explained by the lower oxygen fugacity in lunar basalts. The significantly higher DV can explain why V is less enriched in evolved lunar basalts than terrestrial basalts. The partition coefficient of Cr between olivine and basalt melt in the Moon is 0.11 to 0.62, which is lower than those in terrestrial settings by a factor of ~2. This is surprising because previous authors showed that Cr partition coefficient is independent of fO2. A quasi-thermodynamically based model is developed to correlate Cr partition coefficient to olivine and melt composition and fO2. The lower Cr partition coefficient between olivine and basalt in the Moon can lead to more Cr enrichment in the lunar magma ocean, as well as more Cr enrichment in mantle-derived basalts in the Moon. Hence, even though Cr is typically a compatible element in terrestrial basalts, it is moderately incompatible in primitive lunar basalts, with a similar degree of incompatibility as V based on partition coefficients in this work, as also evidenced by the relatively constant V/Cr ratio of 0.039 ± 0.011 in lunar basalts. The confirmation of constant V/Cr ratio is important for constraining concentrations of Cr (slightly volatile and siderophile) and V (slightly siderophile) in the bulk silicate Moon. 
    more » « less
  5. Abstract Long-read sequencing technology enables significant progress in de novo genome assembly. However, the high error rate and the wide error distribution of raw reads result in a large number of errors in the assembly. Polishing is a procedure to fix errors in the draft assembly and improve the reliability of genomic analysis. However, existing methods treat all the regions of the assembly equally while there are fundamental differences between the error distributions of these regions. How to achieve very high accuracy in genome assembly is still a challenging problem. Motivated by the uneven errors in different regions of the assembly, we propose a novel polishing workflow named BlockPolish. In this method, we divide contigs into blocks with low complexity and high complexity according to statistics of aligned nucleotide bases. Multiple sequence alignment is applied to realign raw reads in complex blocks and optimize the alignment result. Due to the different distributions of error rates in trivial and complex blocks, two multitask bidirectional Long short-term memory (LSTM) networks are proposed to predict the consensus sequences. In the whole-genome assemblies of NA12878 assembled by Wtdbg2 and Flye using Nanopore data, BlockPolish has a higher polishing accuracy than other state-of-the-arts including Racon, Medaka and MarginPolish & HELEN. In all assemblies, errors are predominantly indels and BlockPolish has a good performance in correcting them. In addition to the Nanopore assemblies, we further demonstrate that BlockPolish can also reduce the errors in the PacBio assemblies. The source code of BlockPolish is freely available on Github (https://github.com/huangnengCSU/BlockPolish). 
    more » « less
  6. Robinson, Peter (Ed.)
    Abstract Motivation Oxford Nanopore sequencing producing long reads at low cost has made many breakthroughs in genomics studies. However, the large number of errors in Nanopore genome assembly affect the accuracy of genome analysis. Polishing is a procedure to correct the errors in genome assembly and can improve the reliability of the downstream analysis. However, the performances of the existing polishing methods are still not satisfactory. Results We developed a novel polishing method, NeuralPolish, to correct the errors in assemblies based on alignment matrix construction and orthogonal Bi-GRU networks. In this method, we designed an alignment feature matrix for representing read-to-assembly alignment. Each row of the matrix represents a read, and each column represents the aligned bases at each position of the contig. In the network architecture, a bi-directional GRU network is used to extract the sequence information inside each read by processing the alignment matrix row by row. After that, the feature matrix is processed by another bi-directional GRU network column by column to calculate the probability distribution. Finally, a CTC decoder generates a polished sequence with a greedy algorithm. We used five real datasets and three assembly tools including Wtdbg2, Flye and Canu for testing, and compared the results of different polishing methods including NeuralPolish, Racon, MarginPolish, HELEN and Medaka. Comprehensive experiments demonstrate that NeuralPolish achieves more accurate assembly with fewer errors than other polishing methods and can improve the accuracy of assembly obtained by different assemblers. Availability and implementation https://github.com/huangnengCSU/NeuralPolish.git. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  7. Abstract

    In plants, cytosine DNA methylations (5mCs) can happen in three sequence contexts as CpG, CHG, and CHH (where H = A, C, or T), which play different roles in the regulation of biological processes. Although long Nanopore reads are advantageous in the detection of 5mCs comparing to short-read bisulfite sequencing, existing methods can only detect 5mCs in the CpG context, which limits their application in plants. Here, we develop DeepSignal-plant, a deep learning tool to detect genome-wide 5mCs of all three contexts in plants from Nanopore reads. We sequenceArabidopsis thalianaandOryza sativausing both Nanopore and bisulfite sequencing. We develop a denoising process for training models, which enables DeepSignal-plant to achieve high correlations with bisulfite sequencing for 5mC detection in all three contexts. Furthermore, DeepSignal-plant can profile more 5mC sites, which will help to provide a more complete understanding of epigenetic mechanisms of different biological processes.

     
    more » « less
  8. null (Ed.)
    Subducting tectonic plates carry water and other surficial components into Earth’s interior. Previous studies suggest that serpentinized peridotite is a key part of deep recycling, but this geochemical pathway has not been directly traced. Here, we report Fe-Ni–rich metallic inclusions in sublithospheric diamonds from a depth of 360 to 750 km with isotopically heavy iron (δ 56 Fe = 0.79 to 0.90‰) and unradiogenic osmium ( 187 Os/ 188 Os = 0.111). These iron values lie outside the range of known mantle compositions or expected reaction products at depth. This signature represents subducted iron from magnetite and/or Fe-Ni alloys precipitated during serpentinization of oceanic peridotite, a lithology known to carry unradiogenic osmium inherited from prior convection and melt depletion. These diamond-hosted inclusions trace serpentinite subduction into the mantle transition zone. We propose that iron-rich phases from serpentinite contribute a labile heavy iron component to the heterogeneous convecting mantle eventually sampled by oceanic basalts. 
    more » « less
  9. Abstract

    As an element ubiquitous in the Solar system, the isotopic composition of iron exhibits rich variations in different planetary reservoirs. Such variations reflect the diverse range of differentiation and evolution processes experienced by their parent bodies. A key in deciphering iron isotope variations among planetary samples is to understand how iron isotopes fractionate during core formation. Here we report new Nuclear Resonant Inelastic X‐ray Scattering experiments on silicate glasses of bulk silicate Earth compositions to measure their force constants at high pressures of up to 30 GPa. The force constant results are subsequently used to constrain iron isotope fractionation during core formation on terrestrial planets. Using a model that integrates temperature, pressure, core composition, and redox state of the silicate mantle, we show that core formation might lead to an isotopically light mantle for small planetary bodies but a heavy one for Earth‐sized terrestrial planets.

     
    more » « less